Protein Engineering, Design and Selection
◐ Oxford University Press (OUP)
All preprints, ranked by how well they match Protein Engineering, Design and Selection's content profile, based on 14 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Hsu, C.; Nisonoff, H.; Fannjiang, C.; Listgarten, J.
Show abstract
Predictive modelling of protein properties has become increasingly important to the field of machine-learning guided protein engineering. In one of the two existing approaches, evolutionarily-related sequences to a query protein drive the modelling process, without any property measurements from the laboratory. In the other, a set of protein variants of interest are assayed, and then a supervised regression model is estimated with the assay-labelled data. Although a handful of recent methods have shown promise in combining the evolutionary and supervised approaches, this hybrid problem has not been examined in depth, leaving it unclear how practitioners should proceed, and how method developers should build on existing work. Herein, we present a systematic assessment of methods for protein fitness prediction when evolutionary and assay-labelled data are available. We find that a simple baseline approach we introduce is competitive with and often outperforms more sophisticated methods. Moreover, our simple baseline is plug-and-play with a wide variety of established methods, and does not add any substantial computational burden. Our analysis highlights the importance of systematic evaluations and sufficient baselines.
Alcantar, M. A.; Paulk, A. M.; Moradi, S.; Bhar, D.; Keller, G. L. J.; Sanyal, T.; Bai, H.; Camdere, G.; Han, S. J.; Jain, M.; Jew, B.; Vatansever Inak, S.; Langmead, C. J.; Tinberg, C. E.; Chen, I.; Liu, C. C.
Show abstract
Computational protein design enables the generation of binders that target specific epitopes on proteins. However, current approaches often require substantial screening from which hits require further affinity maturation. Methods for experimentally improving designed proteins and exploring their sequence-affinity landscapes could therefore streamline the development of high-affinity binders and inform future design strategies. Here, we use OrthoRep, a system for continuous hypermutation in vivo, to drive the evolution of computationally designed mini protein binders ("minibinders") that target a mammalian receptor. Despite their small sizes (59-72 amino acids), we successfully affinity matured multiple minibinders through strong selection for improved binding and also sampled new regions of minibinder fitness landscapes through extensive neutral drift. One evolved minibinder variant was used to construct a combinatorially complete sequence-affinity map for its six affinity increasing mutations, which revealed nearly full additivity in their contributions to binding. Another minibinder was subjected to both deep mutational scanning and extensive evolution under weak selection, resulting in an evolutionarily diverged collection of binder sequences that revealed non-additive relationships among mutations. Our results highlight that the affinity of computationally designed binders can be rapidly increased through evolution and provide a scalable approach for the evolutionary exploration and subsequent mapping of sequence-affinity landscapes. We suggest that this work will complement protein binder design both as a reliable experimental optimization process and as a vehicle for generating new training data.
Hurley, J.; Shlosman, I.; Lakshminarayan, M.; Zhao, Z.; Yue, H.; Nowak, R.; Fischer, E. S.; Kruse, A.
Show abstract
Protein-peptide interactions underlie key biological processes and are commonly utilized in biomedical research and therapeutic discovery. It is often desirable to identify peptide sequence properties that confer high-affinity binding to a target protein. However, common approaches to such characterization are typically low throughput and only sample regions of sequence space near an initial hit. To overcome these challenges, we built a yeast surface displayed library representing [~]6.1 x 109 unique peptides. We then performed screens against diverse protein targets, including two antibodies, an E3 ubiquitin ligase, and an essential membrane-bound bacterial enzyme. In each case, we observed motifs that appear to drive peptide binding and we identified multiple novel, high-affinity clones. These results highlight the librarys utility as a robust and versatile tool for discovering peptide ligands and for characterizing protein-peptide binding interactions more generally. To enable further studies, we will make the library freely available upon request.
Li, N.; Vater, A.; Siegel, J. B.
Show abstract
Protein design is advancing toward quantitative modeling of enzyme function and stability. However, progress remains limited by the scarcity of standardized experimental datasets for training and benchmarking computational models. The Design to Data (D2D) program addresses this need by generating harmonized measurements of catalytic and stability parameters across an extensive {beta}-glucosidase B (BglB) variant library. Here, we expand the D2D dataset with kinetic and thermal characterization of five single-point BglB variants and the wild-type (WT), including soluble expression, Michaelis-Menten constants (kcat, KM, and kcat/KM), and melting temperature (TM,). Foldit Standalone was used to model the structural effects of the mutations. In this study, a weak but consistent association between Foldit total system energy (TSE) and TM was observed, suggesting local energetic effects that may influence stability. Together with the broader D2D corpus, these data enhance the functional mapping of BglB and provide model-ready benchmarks for developing and evaluating data-driven predictors of enzyme activity and stability.
Dallago, C.; Mou, J.; Johnston, K. E.; Wittmann, B.; Bhattacharya, N.; Goldman, S. L.; Madani, A.; Yang, K. K.
Show abstract
Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Critical to its use in designing proteins with desired properties, machine learning models must capture the protein sequence-function relationship, often termed fitness landscape. Existing bench-marks like CASP or CAFA assess structure and function predictions of proteins, respectively, yet they do not target metrics relevant for protein engineering. In this work, we introduce Fitness Landscape Inference for Proteins (FLIP), a benchmark for function prediction to encourage rapid scoring of representation learning for protein engineering. Our curated tasks, baselines, and metrics probe model generalization in settings relevant for protein engineering, e.g. low-resource and extrapolative. Currently, FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families. In order to enable ease of use and future expansion to new tasks, all data are presented in a standard format. FLIP scripts and data are freely accessible at https://benchmark.protein.properties.
McWhirter, J. L.; Mukhopadhyay, A.; Farber, P.; Lakatos, G.; Dixit, S.
Show abstract
Functional biologics design is a multi-objective optimization problem often with competing design objectives. We report on a novel deep learning based protein sequence prediction framework, ZymeSwapNet, that can be customized to handle a wide range of quantifiable design objectives, a current limitation of traditional protein design methods. We train a simple convolutional neural network (1D-CNN) on nonredundant curated protein crystal structures, using a set of geometric and topological features that describes a local protein environment, to predict the likelihood of each amino acid type for residue sites in the design region. While the model can be directly used to rank templates derived from mutagenesis campaigns, we extend the scope by developing a sequence/mutation generator that optimizes the desired multivariate distribution using a Monte-Carlo sampling. Using a case study - the design of a stable heterodimeric Fc (HetFc) antibody domain - we show that we can further include a Metropolis criterion to bias the sampling to enhance features such as the heterodimeric binding specificity, in addition to original sampling objective of enhancing stability. We demonstrate that ZymeSwapNet can generate stable HetFc designs, within minutes that had taken several rounds of rational structure and physical force-field based modeling attempts.
Clark-ElSayed, A.; Creed, E.; Nayvelt, K.; Ellington, A.
Show abstract
Recently, the number of ML-based tools for protein design has greatly expanded. Although there have been many successful uses of these tools for improved stability, solubility, and ligand binding, there have been fewer uses of these tools for designing proteins that have intrinsic allosteric mechanisms. In this regard, allosteric transcription factors (aTFs) are a class of regulatory proteins that includes repressors and activators that respond to environmental signals by allosteric communication to regulate their binding with DNA elements. The data exist for evaluating design algorithms for their ability to take allostery into account, as many aTFs have previously been engineered to respond to new ligands, enabling their use as biosensors. In particular, previous work from our lab used directed evolution to change the effector specificity of the transcriptional repressor, RamR, from cholic acids to each of five benzylisoquinoline alkaloids (BIAs). We wanted to see to what extent we could recapitulate these results by instead using LigandMPNN to design the ligand binding pocket. The wild-type RamR structure was predicted in complex with the five BIAs, and the binding pocket was then targeted for computational redesign. However, there was little overlap between the results of directed evolution and computational redesign, and in fact the nine redesigned protein variants tested proved not to be functional in Escherichia coli. Overall, these and other results suggest that different protein design methods may be needed to advance the computational design of allosteric or conformationally flexible proteins.
Stern, J.; Alharbi, S.; Sandholu, A.; Arold, S. T.; Della Corte, D.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWA general method for designing proteins with high conformational specificity is desirable for a variety of applications, including enzyme design and drug target redesign. To assess the ability of algorithms to design for conformational specificity, we introduce MotifDiv, a benchmark dataset of 200 conformational specificity design challenges. We also introduce CSDesign, an algorithm for designing proteins with high preference for a target conformation over an alternate conformation. On the MotifDiv benchmark, CSDesign designs protein sequences that are predicted to prefer the target conformation. We apply this method in vitro to redesign human MAP kinase ERK2, an enzyme with active and inactive conformations. Out of two designs for the active conformation, one increased activity sufficiently to retain activity in the absence of activating phosphorylations, a property not present in the wild type protein.
Yao, Z.; Metts, J. M.; Huber, A. K.; Li, J.; Kinjo, T.; Dieckhaus, H.; Nallathambi, A.; Bowers, A.; Kuhlman, B.
Show abstract
Recent advances in machine learning (ML)-based protein design methods have enabled the rapid in silico generation of large libraries of miniprotein binders with minimal manual input. While computational design capacity has scaled rapidly, experimental validation methods have lagged, creating a bottleneck in binder discovery pipelines. Here, we apply mRNA display to screen an ML-designed miniprotein binder library and directly compare its performance with the more widely used yeast surface display platform using a single shared DNA library. We screened 2,009 designs targeting the platelet receptor TLT-1 and 3,159 designs targeting the immune receptor B7-H3 across both platforms. While both selection methods reliably identified functional binders, we found that mRNA display preferentially enriched binders with slower dissociation rates. In addition, mRNA display achieved higher library coverage than yeast display, likely rescuing functional designs that are penalized in a cell-based expression system. Biophysical characterization of selected binders from both platforms revealed strong binding affinities and high thermal stabilities. These results showcase the power of integrating ML-based computational design tools with rapid in vitro selection technologies, providing a scalable framework for therapeutic miniprotein discovery. IMPORTANCEMiniprotein binders offer major advantages as next-generation therapeutics, including small size, high stability, and efficient production. In this work, we conduct a side-by-side comparison of mRNA and yeast display as platforms for high-throughput evaluation of de novo miniprotein binders. The binders generated here serve as starting points for therapeutics targeting TLT-1 or B7-H3, two clinically relevant molecules.
Cotet, T.-S.; Krawczuk, I.; Pacesa, M.; Nickel, L.; Correia, B. E.; Haas, N.; Qamar, A.; Challacombe, C. A.; Kidger, P.; Ferragu, C.; Naka, A.; Castorina, L. V.; Subr, K.; Kluonis, T.; Stam, M. J.; Unal, S. M.; Wood, C. W.; Stocco, F.; Ferruz, N.; Kurumida, Y.; Calia, C. N.; Paesani, F.; Machado, L. d. A.; Belot, E.; Gitter, A.; Campbell, M. J.; Hallee, L.; Adaptyv Competition Organizers,
Show abstract
In this report, we summarize and analyze the 2024 Adaptyv protein design competition. Participants used computational and Machine Learning (ML) methods of their choice to design proteins that bind the Epidermal Growth Factor Receptor (EGFR), a key drug target involved in cell growth, differentiation, and cancer development. Over 1,800 designs were submitted across two rounds. Of these, 601 proteins were selected and characterized for expression and binding affinity to EGFR, with competitors both optimizing existing binders (KD = 1.21 nM) and creating de novo binders (KD = 82 nM). All selected designs were experimentally validated using Adaptyvs automated Bio-Layer Interferometry (BLI) pipeline. This competition illustrates the potential of crowdsourcing to drive creativity and innovation in protein design. However, it also exposed key challenges, such as the lack of standardized benchmarks, experimental design targets, and robust computational metrics for method comparison. We anticipate that future competitions will address these gaps and further motivate progress in computational protein design.
Godin, R.; Hejazi, S. S.; Lange, B.; Aldamak, B.; Reuel, N. F.
Show abstract
Active learning-guided protein engineering e2iciently navigates the challenging fitness landscape by screening designs iteratively in a model-guided design-build-test-learn cycle. However, while high iterations boost performance, current workflows reliance on tedious and costly cell-based cloning and expression steps limits the iterations they can practically implement. To address this problem, we present a novel combinatorial mutagenesis workflow that uses small ([~]20-40 bp) mutagenic annealed-oligo fragments and cell-free expression to rapidly and conveniently screen protein variants in <9 hours. Using bulk-prepared mutagenic oligos eliminates the need for cloning, PCR-based mutagenesis, or ordering costly genes each screening round. Their >80% size reduction from current fragment-based shu2ling strategies also helps avoid including multiple mutations on the same fragment, reducing the number one must order to cover the design space. By screening 3-10 fragment assemblies for two di2erent proteins, we show our approach is a general, scalable, and cost-e2ective platform for high-iteration protein engineering.
Notin, P.; Weitzman, R.; Marks, D. S.; Gal, Y.
Show abstract
Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. How-ever, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric trans-former variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust perfor-mance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.
Jiang, B.; Li, X.; Guo, A.; Wei, M.; Wu, J.
Show abstract
The design of high-affinity protein binders is critical for biochemical detection, yet traditional methods remain labor-intensive. AI-driven tools like RFdiffusion, a RoseTTAFold-based diffusion model, offer promising alternatives for generating protein structures with tailored binding interfaces. This study evaluates RFdiffusions efficacy in designing de novo binders for six targets: Strep-Tag II (a peptide tag) and five eukaryotic proteins (STAT3, FGF4, EGF, PDGF-BB, and CD4). Five binders were designed for each target and experimentally validated. While two Strep-Tag II binders outperformed streptavidin in Western blot assays, none matched the sensitivity of anti-Strep-Tag II antibodies. Binders for the other targets failed due to low expression, nonspecific binding, or undetectable affinity. Despite generating structurally diverse candidates, RFdiffusions success rate was limited by low-affinity designs and inconsistent recombinant expression. These results underscore the need for further optimization of AI-driven protein design tools for practical biochemical applications.
Shuai, R. W.; Lu, T.; Bhatti, S.; Kouba, P.; Huang, P.
Show abstract
Structure-conditioned sequence design models aim to design a protein sequence that will fold into a given target structure. Deep-learning-based approaches for sequence design have proven highly successful for various protein design applications, but many non-idealized backbones still remain out of reach for current models under typical in silico success criteria. We hypothesize that training objectives prioritizing native sequence recovery unintentionally push models to reproduce non-structural signals (e.g. phylogenetic relatedness, neutral drift, or dataset sampling biases), rather than a broadly generalizable structure-sequence mapping. Inspired by recent work bridging sequence likelihood and fitness prediction in protein language models, we introduce Caliby, a Potts model-based sequence design method capable of conditioning on an ensemble of structures. Conditioning on a synthetic ensemble generated from an input backbone allows sampling of sequences consistent with the structural constraints of the ensemble while averaging out undesired biases towards the native sequence. Ensemble-conditioned sequence design with Caliby reduces native sequence recovery while substantially improving AlphaFold2 self-consistency, outperforming state-of-the-art models ProteinMPNN and ChromaDesign on both native and de novo backbones. Finally, we train a variant of Caliby on only soluble proteins and demonstrate in silico that Protpardelle-1c binder designs that were previously deemed undesignable by SolubleMPNN are actually designable under SolubleCaliby, highlighting limitations of existing filtering pipelines. These results suggest that Caliby can expand the de novo design space beyond highly idealized backbones.
Didi, K.; Alamdari, S.; Lu, A. X.; Wittmann, B.; Johnston, K. E.; Amini, A. P.; Madani, A. K.; Czeneszew, M.; Dallago, C.; Yang, K. K.
Show abstract
Machine learning methods that predict protein fitness from sequence remain sensitive to changes in data distributions, limiting generalization across common conditions encountered in protein engineering. Practically, protein engineers are thus left wondering about the effective utility of ML tools. The FLIP benchmark established protocols for testing generalization under some domain shifts, but it was limited to measurements of thermostability, binding, and viral capsid viability. We introduce FLIP2, a protein fitness benchmark spanning seven new datasets, including enzymes, protein-protein interactions, and light-sensitive proteins, as well as splits that measure generalization relevant to real-world protein engineering campaigns. Evaluating a suite of benchmark models across these datasets and suites reveals that simpler models often matched or outperformed fine-tuned protein language models on FLIP2, challenging the utility of existing transfer learning techniques. Provenance for all datasets has been recorded and we redistribute all data CC-BY 4.0 to facilitate continued progress.
Bentham, A. R.; Youles, M.; Mendel, M. N.; Varden, F. A.; De la Concepcion, J. C.; Banfield, M. J.
Show abstract
The ability to recombinantly produce target proteins is essential to many biochemical, structural, and biophysical assays that allow for interrogation of molecular mechanisms behind protein function. Purification and solubility tags are routinely used to maximise the yield and ease of protein expression and purification from E. coli. A major hurdle in high-throughput protein expression trials is the cloning required to produce multiple constructs with different solubility tags. Here we report a modification of the well-established pOPIN expression vector suite to be compatible with modular cloning via Type IIS restriction enzymes. This allows users to rapidly generate multiple constructs with any desired tag, introducing modularity in the system and delivering compatibility with other modular cloning vector systems, for example streamlining the process of moving between expression hosts. We demonstrate these constructs maintain the expression capability of the original pOPIN vector suite and can also be used to efficiently express and purify protein complexes, making these vectors an excellent resource for high-throughput protein expression trials. HighlightsO_LIpOPIN-GG expression vectors allow for modular cloning enabling rapid screening of purification and solubility tags at no loss of expression compared to previous vectors. C_LIO_LICloning into the pOPIN-GG vectors can be performed from PCR products or from level 0 vectors containing the required parts. C_LIO_LISeveral vectors with different resistances and origins of replication have been generated allowing the effective co-expression and purification of protein complexes. C_LIO_LIAll pOPIN-GG vectors generated here are available on Addgene, as well as level 0 acceptors and tags. C_LI
Ma, E.; Kummer, A.
Show abstract
UniRep is a recurrent neural network model trained on 24 million protein sequences, and has shown utility in protein engineering. The original model, however, has rough spots in its implementation, and a convenient API is not available for certain tasks. To rectify this, we reimplemented the model in JAX/NumPy, achieving near-100X speedups in forward pass performance, and implemented a convenient API for specialized tasks. In this article, we wish to document our model reimplementation process with the goal of educating others interested in learning how to dissect a deep learning model, and engineer it for robustness and ease of use.
Rao, R. M.; Meier, J.; Sercu, T.; Ovchinnikov, S.; Rives, A.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWUnsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1
Rajendran, S.; Kottaiyl, I.; Webster, L.; Vavilala, D.; Hunter, M.; Konar, M.; Karunakaran, S.; Pereira, M.; Johnson, J.; Minshull, J.; Boldog, F.
Show abstract
Bispecific antibodies are at the forefront of biopharmaceutical drug development. With over 100 different molecular architectures combined with diverse individual subunit sequences, choosing the most suitable structure and predicting the ideal subunit expression ratios for successful heterodimerization is a significant challenge. In this paper, we demonstrate that the recently described cell line development paradigm shift (Rajendran et al. 2021), enabled by the Leap-In transposon platform, can be extended to the development of bispecific monoclonal antibody-producing cell substrates (stable clones and pools). The key features are 1) Parental pools reliably predict the derivative clonal productivity and clonal heterodimer fractions. 2) Clonal productivity and clonal heterodimer fraction remained stable for at least 60 population doublings. 3) Depending on the products biophysicochemical properties, the stable pools exhibit variable productivity stability. 4) Heterodimer fractions remain stable in the Leap-In mediated stable pools independently of the productivity stability of the pools. 5) Structures and subunit ratios can be triaged at stable pool level, and 6) Due to the homogeneous clonal productivity distribution, only a small number ([~]50) of clones need to be isolated and characterized.
Maduros, A.; Farinsky, L.; Tagkopoulos, P.; Vater, A.; Siegel, J. B.
Show abstract
This study explores computational design predictions related to experimental enzyme behavior by analyzing seven single-point mutants of {beta}-glucosidase B (BglB) from Paenibacillus polymyxa: Y333F, A88E, L219Q, A408H, Y173L, E340S, and Y422F. Each mutation was modeled using Foldit Standalone, and mutant selections were based on predicted thermodynamic stability changes of interest. Six of the seven mutants in this set yielded soluble, expressed protein. Most variants had similar catalytic efficiency compared to the wild type with one exception. The melting temperatures for most variants were also similar to the wild type. Correlation analysis revealed weak but potentially informative relationships between predicted {Delta}TSE and (a) thermal stability and (b) catalytic efficiency. These results further support known limitations of TSE score as a tool for single point mutation design and add to a growing dataset being generated to build the next generation of functionally predictive protein models.